Data Import

Load raw data (from https://www.kaggle.com/ludobenistant/hr-analytics), add factors, save.

HR <- read_csv("HR_comma_sep.csv", col_types=cols(
  satisfaction_level = col_double(),
  last_evaluation = col_double(),
  number_project = col_integer(),
  average_montly_hours = col_integer(),
  time_spend_company = col_integer(),
  Work_accident = col_integer(),
  left = col_integer(),
  promotion_last_5years = col_integer(),
  sales = col_character(),
  salary = col_character()
))
HR <- plyr::rename(HR, replace=c("satisfaction_level"="Satis", "last_evaluation"="Eval",
            "number_project"="NumProj", "average_montly_hours"="MonHrs",
            "time_spend_company"="Tenure", "Work_accident"="Accdt01", 
            "left"="Left01", "promotion_last_5years"="Promo5yr01",
            "sales"="Dept", "salary"="Salary"))

HR$Salary <- factor(HR$Salary, levels=c("low", "medium", "high"), ordered=TRUE)
HR$Dept <- factor(HR$Dept)

HR$Accdt <- factor(HR$Accdt01 == 1)
HR$Promo5yr <- factor(HR$Promo5yr01 == 1)
HR$Left <- factor(HR$Left01==1)

set.seed(42)
HR$location <- factor(sapply(HR$Dept, location))
write_csv(HR, "HR.csv")

Data Summary

Dimensional Summary

summary(HR)
##      Satis             Eval           NumProj          MonHrs     
##  Min.   :0.0900   Min.   :0.3600   Min.   :2.000   Min.   : 96.0  
##  1st Qu.:0.4400   1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0  
##  Median :0.6400   Median :0.7200   Median :4.000   Median :200.0  
##  Mean   :0.6128   Mean   :0.7161   Mean   :3.803   Mean   :201.1  
##  3rd Qu.:0.8200   3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0  
##  Max.   :1.0000   Max.   :1.0000   Max.   :7.000   Max.   :310.0  
##                                                                   
##      Tenure          Accdt01           Left01         Promo5yr01     
##  Min.   : 2.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 3.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median : 3.000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   : 3.498   Mean   :0.1446   Mean   :0.2381   Mean   :0.02127  
##  3rd Qu.: 4.000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :10.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##                                                                      
##           Dept         Salary       Accdt        Promo5yr    
##  sales      :4140   low   :7316   FALSE:12830   FALSE:14680  
##  technical  :2720   medium:6446   TRUE : 2169   TRUE :  319  
##  support    :2229   high  :1237                              
##  IT         :1227                                            
##  product_mng: 902                                            
##  marketing  : 858                                            
##  (Other)    :2923                                            
##     Left                location   
##  FALSE:11428   Brisbane     :1801  
##  TRUE : 3571   London       :8925  
##                San Francisco:4273  
##                                    
##                                    
##                                    
## 

Pairwise Scatter Plots and Distributions for Staying vs Leaving Staff

  • Most dimensions show significantly different distributions for Staying vs Leaving staff
  • Bimodal distributions for Leaving staff are evident for evaluation, number of projects and monthly hours worked
  • Clustering for Leaving staff is evident comparing: satisfaction, evaluation and monthly hours worked; tenure across many dimensions; and between some departments
plot.new()
png(filename="hr-scatter.png", res=300, width = 3000, height = 3000)
my_colors <- brewer.pal(3, "Set2")
samp.sz <- 3750
pct <- round(100*samp.sz/nrow(HR))
data <- HR[sample(nrow(HR), samp.sz),]
scatterplotMatrix(~Satis+Eval+NumProj+MonHrs+Tenure+Accdt01+Promo5yr01+Dept+Salary|Left01, data=data, reg.line="", smoother="", col=my_colors , smoother.args=list(col="grey") , pch=c(3,4), legend.plot=FALSE, main=paste("Scatter Plot Pairs for HR Turnover (sampling ",pct,"% of data)", sep=""))
par(xpd=TRUE, cex=0.7)
Ly <- 1.16 # for inline HTML
Ly <- 1.05 # for PNG
legend(x=0.91, y=Ly, c("Stay", "Leave"), col=my_colors, pch=c(3,4), horiz=TRUE)
dev.off()

Discussion

Pairwise scatterplots and distributions suggest multi-modal distributions and significant clustering between dimensions for the subset of staff that leave. Add features to classify dimensions for:

  • evaluation
  • satisfaction
  • tenure
  • number of projects
  • monthly hours worked

Pairwise Numeric Correlations

Correlations between all pairs of numeric dimensions are investigated over the full dataset, and within the Leaving and Staying subsets.

Note that three dimensions are binary 0/1 values and need careful interpretation for correlations:

  • Left01 has the value 1 for leave and 0 for stay

  • Promo5yr01 has the value 1 for a promotion within the last 5 years and 0 for none

  • Accdt01 has the value 1 for accident and 0 for none

P-values (and optionally confidence intervals) for all pairs are calculated with the following formula. Correlations not passing the significance test (p < 0.01) will be marked with an ✕ in the matrix.

cor.mtest <- function(mat, conf.level = 0.95){
  mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
    diag(p.mat) <- 0
    diag(lowCI.mat) <- diag(uppCI.mat) <- 1
    for(i in 1:(n-1)){
        for(j in (i+1):n){
            tmp <- cor.test(mat[,i], mat[,j], conf.level = conf.level)
            p.mat[i,j] <- p.mat[j,i] <- tmp$p.value
            lowCI.mat[i,j] <- lowCI.mat[j,i] <- tmp$conf.int[1]
            uppCI.mat[i,j] <- uppCI.mat[j,i] <- tmp$conf.int[2]
        }
    }
    return(list(p.mat, lowCI.mat, uppCI.mat))
}

Overall Correlations (Within the Full Dataset)

Leaving (where 0=stay and 1=leave) is correlated with:

  • decreasing satisfaction level (-0.39) (negative correlates with decreasing satisfaction)

  • no work accidents (-0.15) (leaving correlates with fewer work accidents)

  • no promotion (-0.06) (leaving correlates with no promotion in the last 5 years)

Leaving is positively correlated with:

  • tenure (0.14) (longer tenure corelates with leaving)

  • average hours worked per month (0.07) and number of projects worked (0.02)

Other significant correlations include:

  • satisfaction level is most correlated with last evaluation (0.11), and most negatively with number of projects (-0.14) and tenure (-0.10)

  • last evaluation is positively correlated with number of projects (0.35), average monthly hours (0.34) and longer tenure (0.13)

  • tenure correlates with the number of projects (0.20), monthly hours worked (0.13), and receeving a promotion (0.07)

  • number of projects correlates with monthly hours worked (0.42)

DF <- HR[1:8]
M <- cor(DF)
pval99 <- cor.mtest(DF,0.99)

corrplot(M, p.mat = pval99[[1]], sig.level=0.01, method = 'pie', order ="hclust", addrect=2,
         tl.col="black", tl.cex = 1, tl.offset = 0.1, tl.srt = 45)

Discussion

The correlations between leaving and decreasing levels of satisfaction, longer tenure, longer hours of work per month, and low levels of promotion are perhaps not surprising but are important and should be investigated further to try and determine any causes.

The relationship between accidents at work, increasing job satisfaction and lower rates of leaving could also be investigated further as it seems counter-intuitive.

Gather additional data on:

  • The reasons for leaving

  • The basis of job satisfaction and tenure

  • The effectiveness of evaluation procedures

  • The reasons for (and impact of) persistent over-work or overloading of projects

This data could be gathered through methods including surveys, interviews, exit interviews and project post-mortems.

Correlations for Staying Employees

  • For staff who stay, the strongest correlation is a negative one between job satisfaction and tenure (-0.17)

  • Job satisfaction is also negatively correlated with the number of projects worked (-0.09)

  • Job satisfaction is positively correlated with evaluation (0.09) and monthly hours worked (0.06)

  • Evaluations are positively correlated with monthly hours worked (0.09), job satisfaction (0.09) and the number of projects worked (0.04)

  • Tenure is positively correlated with promotion (0.09) and projects worked (0.08)

DF <- HR[HR$Left==FALSE,c(1:6,8)]
M <- cor(DF)
pval99 <- cor.mtest(DF,0.99)
corrplot(M, p.mat = pval99[[1]], sig.level=0.01, method = 'pie', order ="hclust", addrect=2,
         tl.col="black", tl.cex = 1, tl.offset = 0.1, tl.srt = 45)

Discussion

Surprisingly, the strongest correlation in the data of staff who stay is decreasing job satisfaction levels with increasing tenure with the company. Unsurprisingly, promotions are correlated with increasing tenure and satisfaction levels are correlated with evaluations.

These staff seem to be staying if they are promoted, happier if they receive good evaluations, but unhappier the longer they stay.

  • More information should be gathered about tenure and job satisfaction levels

Correlations for Leaving Employees

For staff who leave:

  • Satisfaction is negatively correlated with the number of projects (-0.23) and the monthly hours worked (-0.08)

  • Satisfaction is positively correlated with tenure (0.44) and evaluation (0.18)

  • Evaluations are positively correlated with monthly hours worked (0.83), the number of projects worked (0.80), tenure (0.78) and job satisfaction (0.18)

  • Tenure is positively correlated with evaluation (0.78), monthly hours worked (0.66), number of projects worked (0.60) and job satisfaction (0.44)

DF <- HR[HR$Left==TRUE,c(1:6,8)]
M <- cor(DF)
pval99 <- cor.mtest(DF,0.99)
corrplot(M, p.mat = pval99[[1]], sig.level=0.01, method = 'pie', order ="hclust", addrect=2,
         tl.col="black", tl.cex = 1, tl.offset = 0.1, tl.srt = 45)

Discussion

For the staff who leave, there are some similarities with those who stay: tenure is also positively correlated with the monthly hours worked and with evaluations.

However, there are some significant differences. Tenure is positively correlated with job satisfaction and is negatively correlated with promotion — the opposite holds for those who stay.

These staff seem to be hard-working, happier the longer they stay and if they are recieving good evaluations, but leaving after not receiving promotions.

  • More information should be gathered about tenure and the opportunity for advancement

Exploratory Data Analysis

Career advancement

Many people leaving have been with the company at least 4 years but have not had a promotion in the last 5 years, despite working long hours.

with(HR, coplot(MonHrs ~ jitter(Tenure) |
                Promo5yr + Left))

These unpromoted, long-term workers who leave are receiving the highest evaluations.

with(HR, coplot(Eval ~ jitter(Tenure) |
                Promo5yr + Left))

These unpromoted, long-term workers who leave are also reporting high levels of satisfaction with the company.

with(HR, coplot(Satis ~ jitter(Tenure) |
                Promo5yr + Left))

Of the unpromoted workers putting in long hours who are leaving, some report the lowest levels of satisfaction, but more report very high levels of satisfaction.

The ones who are working the fewest hours report fairly low satisfaction.

with(HR, coplot(MonHrs ~ Satis |
                Promo5yr + Left))

The staff on the largest number of projects and who put in the longest hours, with no promotion in 5 years, tend to leave.

with(HR, coplot(MonHrs ~ jitter(NumProj) |
                Promo5yr + Left))

Of the unpromoted workers who left, many had both high evaluations and high levels of satisfaction.

with(HR, coplot(Eval ~ Satis |
                Promo5yr + Left))

Plot: evaluation, satisfaction | hours + Left evaluation, satisfaction | time with co + Left evaluation, satisfaction | NumProj + Left